Piranha: Optimizing Short Jobs in Hadoop

نویسنده

  • Khaled Elmeleegy
چکیده

Cluster computing has emerged as a key parallel processing platform for large scale data. All major internet companies use it as their major central processing platform. One of cluster computing’s most popular examples is MapReduce and its open source implementation Hadoop. These systems were originally designed for batch and massive-scale computations. Interestingly, over time their production workloads have evolved into a mix of a small fraction of large and long-running jobs and a much bigger fraction of short jobs. This came about because these systems end up being used as data warehouses, which store most of the data sets and attract ad hoc, short, data-mining queries. Moreover, the availability of higher level query languages that operate on top of these cluster systems proliferated these ad hoc queries. Since existing systems were not designed for short, latency-sensistive jobs, short interactive jobs suffer from poor response times. In this paper, we present Piranha—a system for optimizing short jobs on Hadoop without affecting the larger jobs. It runs on existing unmodified Hadoop clusters facilitating its adoption. Piranha exploits characteristics of short jobs learned from production workloads at Yahoo! clusters to reduce the latency of such jobs. To demonstrate Piranha’s effectiveness, we evaluated its performance using three realistic short queries. Piranha was able to reduce the queries’ response times by up to 71%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Measuring the Optimality of Hadoop Optimization

In recent years, much research has focused on how to optimize Hadoop jobs. Their approaches are diverse, ranging from improving HDFS and Hadoop job scheduler to optimizing parameters in Hadoop configurations. Despite their success in improving the performance of Hadoop jobs, however, very little is known about the limit of their optimization performance. That is, how optimal is a given Hadoop o...

متن کامل

Design and Implementation of a Two Level Scheduler for HADOOP Data Grids

-----------------------------------------------------------------------------ABSTRACT------------------------------------------------------------------------Hadoop is a large scale distributed processing infrastructure designed to handle data intensive applications. In a commercial large scale cluster framework, a scheduler distributes user jobs evenly among the cluster resources. The proposed ...

متن کامل

Dynamic Proportional Share Scheduling in Hadoop

We present the Dynamic Priority (DP) parallel task scheduler for Hadoop. It allows users to control their allocated capacity by adjusting their spending over time. This simple mechanism allows the scheduler to make more efficient decisions about which jobs and users to prioritize and gives users the tool to optimize and customize their allocations to fit the importance and requirements of their...

متن کامل

FMEM: A Fine-grained Memory Estimator for MapReduce Jobs

MapReduce is designed as a simple and scalable framework for big data processing. Due to the lack of resource usage models, its implementation Hadoop hands over resource planning and optimizing works to users. But users also find difficulty in specifying right resource-related, especially memory-related, configurations without good knowledge of job’s memory usage. Modeling memory usage is chall...

متن کامل

Size-based disciplines for job scheduling in data-intensive scalable computing systems. (Disciplines basées sur la taille pour la planification des jobs dans data-intensif scalable computing systems)

The past decade have seen the rise of data-intensive scalable computing (DISC) systems, such as Hadoop, and the consequent demand for scheduling policies to manage their resources, so that they can provide quick response times as well as fairness. Schedulers for DISC systems are usually focused on the fairness, without optimizing the response times. The best practices to overcome this problem i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 6  شماره 

صفحات  -

تاریخ انتشار 2013